../../_images/badge-colab.svg ../../_images/badge-github-custom.svg

Extracting Patents that cite Publications from a chosen Research Organization

This tutorial shows how to extract and analyse patents data linked to a selected research organization, using the Dimensions Analytics API.

Load libraries and log in

[1]:
# @markdown # Get the API library and login
# @markdown **Privacy tip**: leave the password blank and you'll be asked for it later. This can be handy on shared computers.
username = ""  #@param {type: "string"}
password = ""  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}

# import all libraries and login
!pip install dimcli plotly tqdm -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()
#
import os
import sys
import time
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm.notebook import tqdm as progressbar
#
# charts lib
import plotly.express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
DimCli v0.6.2.4 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)

A couple of utility functions to simplify exporting CSV files to a selected folder

[3]:
#
# data-saving utils
#
DATAFOLDER = "extraction1"
#
if not os.path.exists(DATAFOLDER):
  !mkdir $DATAFOLDER
  print(f"==\nCreated data folder:", DATAFOLDER + "/")
#
def save_as_csv(df, save_name_without_extension):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{DATAFOLDER}/{save_name_without_extension}.csv", index=False)
    print("===\nSaved: ", f"{DATAFOLDER}/{save_name_without_extension}.csv")

Choose a GRID Research Organization

For the purpose of this exercise, we are going to use grid.89170.37. Feel free to change the parameters below as you want, eg by choosing another GRID organization.

[2]:

GRIDID = "grid.89170.37" #@param {type:"string"}

#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Open in Dimensions &#x29c9;</a>'.format(dimensions_url(GRIDID))))

#@markdown ---

1 - Prerequisite: Extracting Publications Data

By looking at the Dimensions API data model, we can see that the connection between Patents and Publications is represented via a directed arrow going from Patents to Publications: that means that we should look for patents records where the publication_ids field contain references to the GRID-publications we are interested in.

Hence, we need to * a) extract all publications linked to one (or more) GRID IDs, and * b) use these publications to extract patents referencing those publications.

[5]:
# Get full list of publications linked to this organization for the selected time frame

q = f"""search publications
        where research_orgs.id="{GRIDID}"
        and year in [{YEAR_START}:{YEAR_END}]
        return publications[basics+category_for+times_cited]"""

pubs_json = dsl.query_iterative(q)
pubs = pubs_json.as_dataframe()

# save the data
save_as_csv(pubs, f"pubs_{GRIDID}")
1000 / 17204
2000 / 17204
3000 / 17204
4000 / 17204
5000 / 17204
6000 / 17204
7000 / 17204
8000 / 17204
9000 / 17204
10000 / 17204
11000 / 17204
12000 / 17204
13000 / 17204
14000 / 17204
15000 / 17204
16000 / 17204
17000 / 17204
17204 / 17204
===
Saved:  extraction1/pubs_grid.89170.37.csv

How many publications per year?

Let’s have a quick look a the publication volume per year.

[6]:
px.histogram(pubs, x="year", y="id", color="type", barmode="group", title=f"Publication by year from {GRIDID}")

What are the main subject areas?

We can use the Field of Research categories information in publications to obtain a breakdown of the publications by subject areas.

This can be achieved by ‘exploding’ the category_for data into a separate table, since there can be more than one category per publication. The new categories table also retains some basic info about the publications it relates to eg journal, title, publication id etc.. so to make it easier to analyse the data.

[7]:
# ensure key exists in all rows (even if empty)
normalize_key("category_for", pubs_json.publications)
normalize_key("journal", pubs_json.publications)
# explode subjects into separate table
pubs_subjects = json_normalize(pubs_json.publications, record_path=['category_for'],
                               meta=["id", "type", ["journal", "title"], "year"],
                               errors='ignore', record_prefix='for_')
# add a new column: category name without digits for better readability
pubs_subjects['topic'] = pubs_subjects['for_name'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

Now we can build a scatter plot that shows the amount and distribution of categories of the years.

[22]:
px.scatter(pubs_subjects, x="year", y="topic", color="type",
           hover_name="for_name",
           height=800,
           marginal_x="histogram", marginal_y="histogram",
           title=f"Top publication subjects for {GRIDID} (marginal subplots = X/Y totals)")

2 - Extracting Patents linked to Publications

In this section we extract all patents linked to the publications dataset previously created. The steps are the following:

  • we loop over the publication IDs and create patents queries, via the referencing publication_ids field of patents

  • we collate all patens data, remove duplicates from patents and save the results

  • finally, we count patents per publication and enrich the original publication dataset with these numbers

[24]:
#
# the main query
#
q = """search patents where publication_ids in {}
  return patents[basics+publication_ids+FOR]"""


from dimcli.shortcuts import chunks_of

#
# let's loop through all grants IDs in chunks and query Dimensions
#
print("===\nExtracting patents data ...")
patents_json = []
BATCHSIZE = 400
VERBOSE = False # set to True to see patents extraction logs
pubsids = pubs['id']

for chunk in progressbar(list(chunks_of(list(pubsids), 400))):
    data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=VERBOSE)
    patents_json += data.patents
    time.sleep(1)


#
# put the patents data into a dataframe, remove duplicates and save
#
patents = pd.DataFrame().from_dict(patents_json)
print("Patents found: ", len(patents))
patents.drop_duplicates(subset='id', inplace=True)
print("Unique Patents found: ", len(patents))
# save
save_as_csv(patents, f"patents_{GRIDID}")
# turning lists into strings to ensure compatibility with CSV loaded data
# see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
patents['publication_ids'] = patents['publication_ids'].apply(lambda x: ','.join(map(str, x)))


#
# count patents per publication and enrich the original dataset
#
def patents_per_pub(pubid):
  global patents
  return patents[patents['publication_ids'].str.contains(pubid)]

print("===\nCounting patents per publication...")

l = []
for x in progressbar(pubsids):
  l.append(len(patents_per_pub(x)))

#
# enrich and save the publications data
#
pubs['patents'] = l
save_as_csv(pubs, f"pubs_{GRIDID}_enriched_patents.csv")

===
Extracting patents data ...

Patents found:  4661
Unique Patents found:  3968
===
Saved:  extraction1/patents_grid.89170.37.csv
===
Counting patents per publication...

===
Saved:  extraction1/pubs_grid.89170.37_enriched_patents.csv.csv

A quick look at the data

[25]:
# display top 3 rows
patents.head(3)
[25]:
FOR assignee_names assignees filing_status granted_year id inventor_names publication_date publication_ids times_cited title year
0 NaN [VIB VZW, UNIV GENT, UNIV BRUSSEL VRIJE] [{'id': 'grid.5342.0', 'city_name': 'Ghent', '... Application NaN WO-2017153345-A1 [TAVERNIER, JAN, VAN DER HEYDEN, José, DEVOOGD... 2017-09-14 pub.1019496911,pub.1006473292,pub.1067588797,p... NaN CD20 BINDING AGENTS AND USES THEREOF 2017
1 [{'id': '2921', 'name': '0912 Materials Engine... [Taiwan Semiconductor Manufacturing Co (TSMC) ... [{'id': 'grid.454156.7', 'city_name': 'Hsinchu... Application NaN US-20160240719-A1 [Meng-Yu Lin, Shih-Yen Lin, Si-Chen Lee, Samue... 2016-08-18 pub.1021290825,pub.1034244510,pub.1022541167 11.0 Semiconductor Devices Comprising 2D-Materials ... 2015
2 [{'id': '2921', 'name': '0912 Materials Engine... [Taiwan Semiconductor Manufacturing Co (TSMC) ... [{'id': 'grid.454156.7', 'city_name': 'Hsinchu... Grant 2018.0 US-9859115-B2 [Meng-Yu Lin, Shih-Yen Lin, Si-Chen Lee, Samue... 2018-01-02 pub.1073831920,pub.1034244510,pub.1021290825,p... NaN Semiconductor devices comprising 2D-materials ... 2015

NOTE The publications dataset now includes patents info too:

[26]:
pubs.sort_values("patents", ascending=False).head(3)
[26]:
author_affiliations category_for id issue journal.id journal.title pages times_cited title type volume year patents
13620 [[{'first_name': 'Leonidas C', 'last_name': 'P... [{'id': '2921', 'name': '0912 Materials Engine... pub.1008622444 2-3 jour.1047200 Organic Electronics 113-121 55.0 High efficiency molecular organic light-emitti... article 4 2003 237
16384 [[{'first_name': 'M.', 'last_name': 'Friedman'... [{'id': '2921', 'name': '0912 Materials Engine... pub.1061127158 2 jour.1123356 IEEE Transactions on Microwave Theory and Tech... 341-348 5.0 Low-Loss RF Transport Over Long Distances article 49 2001 150
11680 [[{'first_name': 'Igor L.', 'last_name': 'Medi... [{'id': '2202', 'name': '02 Physical Sciences'... pub.1012397724 6 jour.1031408 Nature Materials 435-446 4199.0 Quantum dot bioconjugates for imaging, labelli... article 4 2005 90

3 - Patents Data Analysis

Now that we have extracted all the data we need, let’s start exploring them by building a few visualizations.

How many patents per year?

[27]:
px.histogram(patents, x="year", y="id", color="filing_status",
             barmode="group",
             title=f"Patents referencing publications from {GRIDID} - by year")

Who is filing the patents?

This can be done by looking at the field assigness of patent. Since the field contains nested information, first we need to extract it into its own table (similarly to what we’ve done above with publications categories).

[33]:
# ensure the key exists in all rows (even if empty)
normalize_key('assignees', patents_json)
# explode assigness into separate table
patents_assignees = json_normalize(patents_json,
                                   record_path=['assignees'],
                                   meta=['id', 'year', 'title'],
                                   meta_prefix="patent_")
top_assignees = patents_assignees.groupby(['name', 'country_name'],  as_index=False).count().sort_values(by="patent_id", ascending=False)
# preview the data: ps the patent_id column is the COUNT of patents
top_assignees[['name', 'country_name', 'patent_id']].head()
[33]:
name country_name patent_id
549 United States Department of the Navy United States 550
552 Universal Display Corporation United States 232
5 AT&T (United States) United States 155
287 Massachusetts Institute of Technology United States 121
565 University of California System United States 111
[30]:
px.bar(top_assignees, x="name", y="patent_id",
       hover_name="name", color="country_name",
       height=900,
       title=f"Top Assignees for patents referencing publications from {GRIDID}")
[44]:
px.scatter(patents_assignees,  x="name", y="country_name",
           color="patent_year", hover_name="name",
           height = 1000,
           hover_data=["id", "patent_id"],  marginal_y="histogram",
           title=f"Assignees for patents referencing publications from {GRIDID} - Yearly breakdown")

What are the publications most frequenlty referenced in patents?

[36]:
pubs_cited = pubs.query("patents > 0 ").sort_values('patents', ascending=False).copy()
pubs_cited.head()
[36]:
author_affiliations category_for id issue journal.id journal.title pages times_cited title type volume year patents
13620 [[{'first_name': 'Leonidas C', 'last_name': 'P... [{'id': '2921', 'name': '0912 Materials Engine... pub.1008622444 2-3 jour.1047200 Organic Electronics 113-121 55.0 High efficiency molecular organic light-emitti... article 4 2003 237
16384 [[{'first_name': 'M.', 'last_name': 'Friedman'... [{'id': '2921', 'name': '0912 Materials Engine... pub.1061127158 2 jour.1123356 IEEE Transactions on Microwave Theory and Tech... 341-348 5.0 Low-Loss RF Transport Over Long Distances article 49 2001 150
11680 [[{'first_name': 'Igor L.', 'last_name': 'Medi... [{'id': '2202', 'name': '02 Physical Sciences'... pub.1012397724 6 jour.1031408 Nature Materials 435-446 4199.0 Quantum dot bioconjugates for imaging, labelli... article 4 2005 90
14084 [[{'first_name': 'Jyoti K.', 'last_name': 'Jai... [{'id': '2581', 'name': '0601 Biochemistry and... pub.1036950132 1 jour.1115214 Nature Biotechnology 47-51 1466.0 Long-term multiple color imaging of live cells... article 21 2003 69
11398 [[{'first_name': 'Haizhou', 'last_name': 'Yin'... [{'id': '2921', 'name': '0912 Materials Engine... pub.1061591798 10 jour.1019990 IEEE Transactions on Electron Devices 2207-2214 14.0 Ultrathin Strained-SOI by Stress Balance on Co... article 52 2005 67
[37]:
px.bar(pubs_cited[:1000], color="type",
       x="year", y="patents",
       hover_name="title",  hover_data=["journal.title"],
       title=f"Top Publications from {GRIDID} mentioned in patents, by year of publication")

What are the main subject areas of referenced publications?

[38]:
THRESHOLD_PUBS = 1000
citedids = list(pubs_cited[:THRESHOLD_PUBS]['id'])
pubs_subjects_cited = pubs_subjects[pubs_subjects['id'].isin(citedids)]
[39]:
px.scatter(pubs_subjects_cited, x="year", y="topic", color="type",
           hover_name="for_name",
           height=800,
           marginal_x="histogram", marginal_y="histogram",
           title=f"Top {THRESHOLD_PUBS} {GRIDID} publications cited by patents - by subject area")

Is there a correlation from publication citations to patents citations?

Note: if the points on a scatterplot graph produce a lower-left-to-upper-right pattern (see below), that is indicative of a positive correlation between the two variables. This pattern means that when the score of one observation is high, we expect the score of the other observation to be high as well, and vice versa.

[40]:
px.scatter(pubs, x="patents", y="times_cited",
           title=f"Citations of {GRIDID} publications from publications VS from patents")

Conclusions

In this Dimensions Analytics API tutorial we have seen how, starting from a GRID organization, it is possible to extract

    1. publications from authors associated to this organization

    1. patents citing those publications (from any organization)

We have also done a basic analysis of the citing patents dataset, using fields like citation year, assigness etc…

This only scratches the surface of the possible applications of publication-patents linkage data, but hopefully it’ll give you a few basic tools to get started building your own application!



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.

../../_images/badge-dimensions-api.svg